Skip to content

Conversation

@FBruzzesi
Copy link
Member

@FBruzzesi FBruzzesi commented Jul 26, 2025

What type of PR is this? (check all applicable)

  • πŸ’Ύ Refactor
  • ✨ Feature
  • πŸ› Bug Fix
  • πŸ”§ Optimization
  • πŸ“ Documentation
  • βœ… Test
  • 🐳 Other

Checklist

  • Code follows style guide (ruff)
  • Tests added
  • Documented the changes

If you have comments or can explain your changes, please do so below

Related issue: #2722

I am slightly concerned that this could turn out to be a month long PR. I will open it as draft for now. There are a couple of pain points I already know:

  • I was very careful with type hints, yet it somehow doesn't pass type checker Fixed
  • A custom check (importing from dtypes) will fail. That's the default used in the dtype argument. In principle we could just ignore the argument completely and always opt for Int64.
  • Polars can do eager: bool. In our case it's a bit more complex than that. Instead of adding yet another argument to the function, I allowed for eager to be False (default), None or the backend/implementation that should back the series.

@FBruzzesi FBruzzesi added enhancement New feature or request pyarrow Issue is related to pyarrow backend pandas-like Issue is related to pandas-like backends polars Issue is related to polars backend labels Jul 26, 2025
@FBruzzesi FBruzzesi marked this pull request as ready for review July 27, 2025 09:17
@dangotbanned
Copy link
Member

dangotbanned commented Jul 27, 2025

Hey @FBruzzesi, as I've mentioned 8347839 times, I'm very keen to see this in narwhals! πŸ˜„

Would you be okay if we let this one marinate for a bit and maybe target the release after the next? πŸ™

@FBruzzesi
Copy link
Member Author

Hey @FBruzzesi,as I've mentioned 8347839 times, I'm very keen to see this in narwhals! πŸ˜„

πŸŽ‰

Would you be okay if we let this one marinate for a bit and maybe target the release after the next? πŸ™

Yeah I was not expecting this to be finalized by tomorrow. It has some sharp edges that need some love and work

@FBruzzesi
Copy link
Member Author

@MarcoGorelli any "trick" to avoid the custom check in precommit?

  • For compliant namespaces the import of IntegerType is in the if TYPE_CHECKING block
  • For function.py, Int64 is used as default value (and then overwritten by the v*.dtypes.Int64 dtype). I guess one way of doing it is to let it be None and set it inside the function given the version. Yet It's a bit less explicit to a user.

I thought about emulating that utils/import_check.py does, but I would need to figure out what's happening there from scratch πŸ˜‚

@dangotbanned
Copy link
Member

any "trick" to avoid the custom check in precommit?

You could add this to nw.typing, and then use that everywhere instead?

IntegerDType: TypeAlias = "dtypes.IntegerType | type[dtypes.IntegerType]"

@dangotbanned dangotbanned marked this pull request as draft August 20, 2025 17:18
@dangotbanned dangotbanned marked this pull request as ready for review August 20, 2025 17:31
@dangotbanned
Copy link
Member

this is probably fine, i'd just like to take a look before merging please, thanks πŸ™

@MarcoGorelli pinging again as (#2895 (comment)) was a month ago 🫣

@dangotbanned
Copy link
Member

@FBruzzesi I still have hope that eventually we'll land this πŸ™

I have an idea of how we could reuse int_range to implement date_range for pyarrow.

Idea

As an example, if we convert the output of pl.date_range to pyarrow, we can see the pa.DataType as:

import polars as pl
import pyarrow as pa

start, end = dt.date(2000, 1, 1), dt.date(2000, 1, 5)
expr = pl.date_range(start, end).alias("date")
table = pl.select(expr).to_arrow()
table.column("date").type

See date32

DataType(date32[day])

Since that is just represented as a 32-bit integer, we can do things like:

dates = table.column("date")
dates
Show Output

<pyarrow.lib.ChunkedArray object at 0x00000296A456BDC0>
[
  [
    2000-01-01,
    2000-01-02,
    2000-01-03,
    2000-01-04,
    2000-01-05
  ]
]

dates.cast(pa.int32())
Show Output

<pyarrow.lib.ChunkedArray object at 0x00000296A456B8E0>
[
  [
    10957,
    10958,
    10959,
    10960,
    10961
  ]
]

dates.cast(pa.int32()).cast(pa.date32())
Show Output

<pyarrow.lib.ChunkedArray object at 0x00000296A456BAC0>
[
  [
    2000-01-01,
    2000-01-02,
    2000-01-03,
    2000-01-04,
    2000-01-05
  ]
]

In practice

import datetime as dt

import pyarrow as pa


def date_range(
    start: dt.date,
    end: dt.date,
    interval: int,  # (* assuming the `Interval` part is solved)
    *,
    closed: ClosedInterval = "both",
) -> pa.Date32Array:
    start_i = pa.scalar(start).cast(pa.int32()).as_py()
    end_i = pa.scalar(end).cast(pa.int32()).as_py()

    # call `int_range` here for the compatibility branch
    arr = pa.arange(start_i, end_i + 1, interval)
    if closed != "both":
        if closed == "left":
            arr = arr.slice(length=len(arr) - 1)
        elif closed == "none":
            arr = arr.slice(1, len(arr) - 1)
        else:
            arr = arr.slice(1)

    # the first cast would happen in `int_range(dtype=...)`
    return arr.cast(pa.int32()).cast(pa.date32())


start, end = dt.date(2000, 1, 1), dt.date(2000, 2, 1)

date_range(start, end, interval=7, closed="none")
<pyarrow.lib.Date32Array object at 0x00000296A74176A0>
[
  2000-01-08,
  2000-01-15,
  2000-01-22,
  2000-01-29
]

What's the catch?

We'd need to adapt Interval to support a different kind of parse.

Show Interval

class Interval:
def __init__(self, multiple: int, unit: IntervalUnit, /) -> None:
self.multiple: int = multiple
self.unit: IntervalUnit = unit
def to_timedelta(
self, *, unsupported: Container[IntervalUnit] = frozenset(("ns", "mo", "q", "y"))
) -> dt.timedelta:
if self.unit in unsupported: # pragma: no cover
msg = f"Creating timedelta with {self.unit} unit is not supported."
raise NotImplementedError(msg)
kwd = UNIT_TO_TIMEDELTA[self.unit]
# error: Keywords must be strings (bad mypy)
return dt.timedelta(**{kwd: self.multiple}) # type: ignore[misc]
@classmethod
def parse(cls, every: str) -> Interval:
multiple, unit = cls._parse(every)
if unit == "mo" and multiple not in MONTH_MULTIPLES:
msg = f"Only the following multiples are supported for 'mo' unit: {MONTH_MULTIPLES}.\nGot: {multiple}."
raise ValueError(msg)
if unit == "q" and multiple not in QUARTER_MULTIPLES:
msg = f"Only the following multiples are supported for 'q' unit: {QUARTER_MULTIPLES}.\nGot: {multiple}."
raise ValueError(msg)
if unit == "y" and multiple != 1:
msg = (
f"Only multiple 1 is currently supported for 'y' unit.\nGot: {multiple}."
)
raise ValueError(msg)
return cls(multiple, unit)
@classmethod
def parse_no_constraints(cls, every: str) -> Interval:
return cls(*cls._parse(every))
@staticmethod
def _parse(every: str) -> tuple[int, IntervalUnit]:
if match := PATTERN_INTERVAL.match(every):
multiple = int(match["multiple"])
unit = cast("IntervalUnit", match["unit"])
return multiple, unit
msg = (
f"Invalid `every` string: {every}. Expected string of kind <number><unit>, "
f"where 'unit' is one of: {get_args(IntervalUnit)}."
)
raise ValueError(msg)

It would be restricted to only d, w, mo, q, y - but even within that - maybe more flexible given that we know +1d is equivalent to +1 πŸ€”?

Comment on lines +1816 to +1824
@unstable
def int_range(
start: int | Expr,
end: int | Expr | None = None,
step: int = 1,
*,
dtype: IntegerDType = Int64,
eager: IntoBackend[EagerAllowed] | Literal[False] = False,
) -> Expr | Series[Any]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note

I don't think this blocks anything - just a realization from me πŸ™‚

I hadn't been able to put my finger on what seemed off to me wrt this until just now:

Polars can do eager: bool. In our case it's a bit more complex than that.
Instead of adding yet another argument to the function, I allowed for eager to be False (default), None or the backend/implementation that should back the series.

This trick would work for us in all the cases where eager defines Expr | Series.
That is most of them, but there is polars.select as the ugly duckling with:

DataFrame | LazyFrame

@overload
def select(
    *exprs: IntoExpr | Iterable[IntoExpr],
    eager: Literal[True] = ...,
    **named_exprs: IntoExpr,
) -> DataFrame: ...


@overload
def select(
    *exprs: IntoExpr | Iterable[IntoExpr],
    eager: Literal[False],
    **named_exprs: IntoExpr,
) -> LazyFrame: ...


def select(
    *exprs: IntoExpr | Iterable[IntoExpr], eager: bool = True, **named_exprs: IntoExpr
) -> DataFrame | LazyFrame:

If we added select, then we'd need two arguments to be able to say whether we want pl.DataFrame or pl.LazyFrame

def select(
    *exprs: IntoExpr | Iterable[IntoExpr],
    backend: IntoBackend[Backend],
    eager: bool = True,
    **named_exprs: IntoExpr,
) -> DataFrame | LazyFrame: ...

But that also seems a bit footgun-y, since something like select(..., backend="duckdb") would have a conflicting default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

eager-only enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support int_range (Eager-only)

4 participants